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Abstract 

We  present  a  statistical  similarity  measuring  and 
clustering  tool,  SimFinder,  that  organizes  small 
pieces  of  text  from  one  or  multiple  documents  into 
tight  clusters.  By  placing  highly  related  text  units 
in  the  same  cluster,  SimFinder  enables  a  subse¬ 
quent  content  selection/generation  component  to  re¬ 
duce  each  cluster  to  a  single  sentence,  either  by  ex¬ 
traction  or  by  reformulation.  We  report  on  improve¬ 
ments  in  the  similarity  and  clustering  components 
of  SimFinder,  including  a  quantitative  evaluation, 
and  establish  the  generality  of  the  approach  by  in¬ 
terfacing  SimFinder  to  two  very  different  summa¬ 
rization  systems. 

1  Introduction 

Summarization  is  an  application  that  cuts  across 
multiple  natural  language  processing  areas  (search, 
text  analysis,  planning,  generation)  and  for  which 
disparate  approaches  have  been  used,  including 
word  counts  (Luhn,  1958),  information  retrieval 
based  similarity  measures  (Salton  et  ah,  1997),  sta¬ 
tistical  models  (Kupiec  et  al.,  1995),  positional  in¬ 
formation  (Lin  and  Hovy,  1997),  and  discourse 
structure  (Marcu,  1997).  For  multidocument  sum¬ 
marization,  where  the  source  texts  often  contain  the 
same  information  with  variations  in  the  presenta¬ 
tion,  an  alternative  approach  is  to  explicitly  seek 
similar  pieces  of  the  input  text,  on  the  assumption 
that  recurring  text  units  arc  probably  the  more  cen¬ 
tral  ones.  Each  set  of  similar  text  pieces  can  then 
produce  one  sentence  in  the  summary,  either  by  ex¬ 
traction  or  by  reformulation. 

In  this  paper,  we  present  a  statistical  similarity 
and  clustering  tool  that  accomplishes  the  task  of 
finding  similar  text  units  (sentences  or  paragraphs) 
for  summarization,  a  task  that  is  a  component  of 


multiple  current  multidocument  summarization  sys¬ 
tems  (Mani  and  Bloedorn,  1997;  Carbonell  and 
Goldstein,  1998;  McKeown  et  al.,  1999;  Radev  et 
al.,  2000).  Our  tool,  SimFinder,  incorporates  lin¬ 
guistic  features  and  a  sophisticated  clustering  algo¬ 
rithm  to  construct  sets  of  highly  similar  sentences 
or  paragraphs  for  summarization.  In  earlier  work 
(Hatzivassiloglou  et  al.,  1999),  we  discussed  how 
SimFinder’s  text  features  were  selected  and  evalu¬ 
ated;  we  summarize  these  results  in  Section  2,  along 
with  recent  improvements  on  feature  selection  and 
weighting.  In  Section  3,  we  discuss  the  clustering 
algorithm  we  adopted  and  modifications  to  it  spe¬ 
cific  to  the  summarization  task.  Finally,  in  Sec¬ 
tion  4  we  demonstrate  the  flexibility  and  generality 
of  the  approach  by  showing  how  clusters  of  para¬ 
graphs  produced  by  SimFinder  are  used  in  two 
summarization  systems  at  our  institution,  Multi- 
Gen  and  Centrifuser.  We  present  three  imple¬ 
mented  techniques  for  the  summary  generation  task, 
and  outline  several  other  possibilities. 

Our  focus  in  the  present  paper  is  on  describ¬ 
ing  the  incremental  but  significant  improvements 
in  SimFinder’s  features,  machine  learning  model, 
and  clustering  algorithm  over  the  earlier  1999  ver¬ 
sion  (Hatzivassiloglou  et  al.,  1999);  and  on  offering 
evidence  of  the  approach’s  generality  by  showing 
how  SimFinder  was  successfully  interfaced  with 
different  (both  extraction-based  and  reformulation- 
based)  content  selection  and  presentation  systems 
for  summarization. 

2  Using  Machine  Learning  to  Compute 
Similarities 

Clustering  entails  both  developing  a  similarity  met¬ 
ric  and  choosing  an  appropriate  clustering  algo¬ 
rithm.  In  clustering  documents  for  information  re- 


U.N.  Human  Rights  Commissioner  Mary  Robin¬ 
son  made  a  landmark  visit  to  Mexico  at  the  gov¬ 
ernment's  invitation  after  voicing  alarm  last  year 
of  violence  in  the  country's  conflict-torn  southern 
state  of  Chiapas. 

Mexico's  government  last  year  rejected  sugges¬ 
tions  the  United  Nations  might  mediate  in  the  long- 
running  Chiapas  conflict,  saying  it  could  solve  its 
own  internal  affairs.  But  it  did  invite  Robinson 
and  a  special  rapporteur  on  extrajudicial  killings  to 
come  and  assess  human  rights  for  themselves  in  the 
country. 


Figure  1:  Two  similar  paragraphs;  in  bold,  we  high¬ 
light  the  primitive  features  indicating  similarity  that 
are  captured  by  SimFinder. 


trieval  purposes,  as  in  the  recent  Topic  Detection 
and  Tracking  (TDT)  efforts  (Allan  et  ah,  1998;  Fis- 
cus  et  al.,  1999),  the  similarity  measure  is  usu¬ 
ally  based  on  shared  words  only.  This  is  often  ap¬ 
propriate  for  classification  of  documents  into  top¬ 
ics,  although  even  for  document-level  clustering, 
the  use  of  linguistically  informed  features  such  as 
named  entity  tags  can  improve  performance  (Hatzi- 
vassiloglou  et  al.,  2000).  However,  we  have  found 
that  more  specialized  information  can  be  utilized 
when  we  have  to  work  with  smaller  units  of  text 
(sentences  or  paragraphs)  and  we  want  to  put  to¬ 
gether  only  very  similar  units,  as  is  the  case  with 
summarization.  In  fact,  SimFinder  is  designed  to 
handle  input  that  has  already  been  organized  into 
groups  of  documents  tightly  connected  on  topic  and 
date,  either  with  a  separate  TDT-like  clustering  tool 
or  because  the  input  naturally  comes  in  that  form. 

In  our  first  presentation  of  SimFinder’ s  ap¬ 
proach  to  summarization  (Hatzivassiloglou  et  al., 
1999),  we  identified  43  features  that  we  could  ef¬ 
ficiently  extract  from  the  text  and  that  could  plau¬ 
sibly  help  determine  the  semantic  similarity  of  two 
short  text  units.  We  chose  to  use  paragraphs,  rather 
than  sentences,  as  our  unit  of  text  in  most  experi¬ 
ments  because  a  paragraph  is  more  likely  to  contain 
background  information  (such  as  proper  nouns)  rel¬ 
evant  to  semantic  comparison.  Paragraphs  in  news 
documents  often  consist  of  a  single  sentence  in  any 
case.  Figure  1  illustrates  some  of  these  features  by 
means  of  two  example  similar  paragraphs  from  our 
training  corpus. 


An  OH-58  helicopter,  carrying  a  crew  of  two,  was 
on  a  routing  training  orientation  when  contact  was 
lost  at  about  11:30  a.m.  Saturday  (9:30  p.m.  EST 
Friday). 

“There  were  two  people  on  hoard,’’  said  Bacon.  “We 
lost  radar  contact  with  the  helicopter  about  9:15 
EST  (02 15  GMT). 


Figure  2:  A  composite  feature  over  word  primi¬ 
tives,  with  the  restriction  that  one  primitive  must  be 
a  noun  and  one  must  be  a  verb. 


These  paragraphs  have  quite  a  few  words  in  com¬ 
mon,  including  government,  last,  year,  and  coun¬ 
try.  Perhaps  more  significantly,  they  share  sev¬ 
eral  proper  nouns:  Robinson,  Mexico,  and  Chia¬ 
pas,  which  perhaps  should  be  weighted  more  for 
a  match.  Other  similarities  include  words  with  the 
same  stem,  such  as  invitation  and  invite,  and  seman¬ 
tically  related  words  such  as  killings  and  violence. 
In  all,  our  set  of  primitive  features  includes  several 
ways  to  define  a  match  on  a  given  word:  we  con¬ 
sider  matches  involving  identical  words,  as  well  as 
words  that  matched  on  their  stem,  as  noun  phrase 
heads  ignoring  modifiers,  and  as  WordNet  (Miller 
et  al.,  1990)  synonyms.  These  matches  of  primitive 
features  are  further  constrained  by  pail  of  speech 
and  combined  to  form  composite  features  attempt¬ 
ing  to  capture  syntactic  patterns  where  two  primi¬ 
tive  features  have  to  match  within  a  window  of  five 
words  (not  including  stopwords).  The  composite 
features  approximate  in  this  manner  syntactic  rela¬ 
tionships  such  as  subject-verb  or  verb-object  (see 
Figure  2).  In  other  cases,  a  composite  feature  can 
serve  as  a  more  effective  version  of  a  single  primi¬ 
tive  feature.  For  example.  Figure  3  illustrates  a  com¬ 
posite  feature  involving  WordNet  primitives  (i.e., 
words  match  if  they  share  immediate  hypernyms  in 
WordNet)  and  exact  word  match  primitives.  On  its 
own,  the  WordNet  feature  might  introduce  too  much 
noise,  but  in  conjunction  with  the  exact  word  match 
feature  it  can  be  a  useful  indicator  of  similarity. 

For  the  purpose  of  automatic  feature  selection, 
we  developed  a  data  set  consisting  of  10,535  man¬ 
ually  marked  pairs  of  paragraphs  from  the  Reuters 
part  of  the  1997  TDT  pilot  corpus.  Each  pair  of 
paragraphs  was  judged  by  two  human  reviewers, 
working  separately.  The  reviewers  were  asked  to 
make  a  binary  determination  on  whether  the  two 


Boris  Yeltsin  was  hospitalized  Monday  with  what 
doctors  suspect  is  pneumonia,  the  latest  sickness  to 
beset  the  often  ailing  68-year-old  Russian  president. 

Yeltsin  has  been  hospitalized  several  times  in  the 
past  three  years,  usually  with  respiratory  infections, 
including  twice  for  pneumonia  in  1997  and  1998. 
The  Kremlin  tends  to  hospitalize  the  ailing  presi¬ 
dent  at  the  first  sign  of  illness. 


Figure  3:  A  pair  of  paragraphs  that  contain  a  com¬ 
posite  match;  a  word  match  and  a  WordNet  match 
(highlighted  in  bold)  occur  within  a  window  of  five 
words,  excluding  stopwords. 


paragraphs  contained  “common  information”.  This 
was  defined  to  be  the  case  if  the  paragraphs  referred 
to  the  same  object  and  the  object  either  (a)  per¬ 
formed  the  same  action  in  both  paragraphs,  or  (b) 
was  described  in  the  same  way  in  both  paragraphs. 
The  reviewers  were  then  instructed  to  resolve  each 
instance  about  which  they  had  disagreed.  It  is  inter¬ 
esting  to  note  here  that  in  this  and  subsequent  an¬ 
notation  experiments  we  found  significant  disagree¬ 
ments  between  the  judges,  and  large  variability  in 
their  rate  of  agreement  (kappa  statistics  (Carletta, 
1996)  between  0.08  and  0.82).  The  disagreement 
was  however  significantly  lower  when  the  instruc¬ 
tions  were  as  specific  as  the  version  above,  and  in 
any  case  annotators  were  able  to  resolve  their  differ¬ 
ences  and  come  with  a  single  label  of  similar  or  not 
similar  when  they  conferred  after  producing  their 
individual  judgments.  As  the  above  discussion  il¬ 
lustrates,  the  level  of  similarity  that  we  represent  in 
our  training  data  and  that  SimFinder  tries  to  re¬ 
cover  automatically  is  much  more  fine-grained  than 
in  a  typical  information  retrieval  application;  we  arc 
moving  from  topical  similarity  down  to  the  level  of 
propositional  content  similarity. 

We  subsequently  trained  a  classifier  over  both 
primitive  and  composite  features  using  Ripper 
(Cohen,  1996).  Ripper  produces  a  set  of  ordered 
rules  that  can  be  used  to  judge  any  pair  of  para¬ 
graphs  as  similar  or  non-similar.  Using  three-fold 
cross-validation  over  the  training  data.  Ripper  in¬ 
cluded  1 1  of  the  43  features  in  its  final  set  of  rules 
and  achieved  44.1%  precision  at  44.4%  recall.  The 
eleven  features  were  Word  Overlap,  Proper  Noun 
Overlap,  LinkIT  (noun  phrase  head)  Overlap  (Wa- 
cholder,  1998),  Verb  Overlap,  Noun  Overlap,  Ad¬ 


jective  Overlap,  WordNet  Overlap,  WordNet  Verb 
Overlap,  Verb  Overlap,  WordNet  Collocation,  and 
Stem  Overlap  (see  (Hatzivassiloglou  et  ah,  1999) 
for  more  details  on  the  various  features).  The  se¬ 
lection  of  eleven  features  rather  than  just  words  val¬ 
idates  our  claim  that  more  than  word  matching  is 
needed  for  effective  paragraph  matching  for  sum¬ 
marization.  This  was  also  verified  experimentally; 
the  standard  TF*IDF  measure  (Salton  and  Buck- 
ley,  1988),  which  bases  similarity  on  shared  words 
weighted  according  to  their  frequency  in  each  text 
unit  and  their  rarity  across  text  units,  yielded  32.6% 
precision  at  39.1%  recall.  We  also  measured  the 
performance  of  a  standard  IR  system  on  this  task; 
the  SMART  system  (Buckley,  1985),  which  uses  a 
modified  TF*IDF  approach,  achieved  34.1%  preci¬ 
sion  at  36.7%  recall.  In  all  cases,  we  report  evalua¬ 
tion  results  at  the  point  of  the  precision-recall  curve 
where  precision  and  recall  are  closest,  which  is  a 
summary  metric  combining  information  on  the  two 
possible  kinds  of  errors  (as  11 -point  precision  and 
F-measure  also  do).  We  did  not  have  direct  ac¬ 
cess  to  the  more  recent  information  retrieval  sys¬ 
tems  offering  improvements  over  SMART  (e.g„  the 
TDT2  and  TDT3  systems)  so  that  we  could  apply 
them  to  paragraph-length  text  segments  and  directly 
compare  their  performance  to  our  method.  How¬ 
ever,  such  systems  still  primarily  use  word  matches 
for  determining  similarity,  rely  most  commonly  on 
valiants  of  TF*IDF,  and  are  designed  to  operate 
on  text  pieces  much  larger  than  sentences  or  para¬ 
graphs. 

It  is  worth  noting  that  2 1  of  the  43  original  fea¬ 
tures  were  normalized  according  to  the  matching 
primitives’  IDF  scores  (the  number  of  documents 
in  our  collection  they  appeal-  in).  Ripper  selected 
none  of  those  features,  which  suggests  that  TF*IDF 
is  not  an  appropriate  metric  to  use  in  evaluating  sim¬ 
ilarity  between  small  text  units  in  a  system  such 
as  ours.  This  observation  makes  sense  given  that 
in  SimFinder  the  collection  of  documents  from 
which  document  frequency  is  calculated  has  been 
filtered  by  topic  and  date.  Thus,  a  primitive  that 
would  be  rare  in  a  large  corpus  could  have  an  ab¬ 
normally  high  frequency  in  the  relatively  small  set 
of  related  documents  on  which  SimFinder  oper¬ 
ates. 

Since  performing  this  evaluation,  we  have  refined 
some  of  our  features  and  added  new  ones.  We  now 
take  a  more  sophisticated  view  of  proper  names, 
maintaining  a  list  of  previously  seen  proper  name 


Precision 

Recall 

Fi -measure 

Standard  TF*IDF 

32.6% 

39.1% 

35.6% 

SMART 

34.1% 

36.7% 

35.4% 

1999  SimFinder  (with  Ripper) 

44.1% 

44.4% 

44.2% 

2001  SimFinder  (with  log-linear  model) 

49.3% 

52.9% 

51.0% 

Table  1:  Evaluation  scores  for  several  similarity  computation  techniques.  The  test  data  consisted  of  pairs 
of  paragraphs  from  closely  related  documents  in  the  Reuters  paid  of  the  1997  TDT  pilot  corpus,  manually 
labeled  as  similar  or  not  similar. 


forms  and  allowing  for  partial  matches  (i.e.,  im¬ 
plementing  a  limited  co-reference  resolution  com¬ 
ponent)  so  that  multiple  forms  of  the  same  name 
can  be  collated.  We  have  added  filters  eliminating 
some  categories  of  linking  verbs  and  function  words 
from  our  feature  counts,  and  incorporated  a  new  fea¬ 
ture  that  tracks  whether  two  paragraphs  come  from 
the  same  article  (hypothesizing  that  highly  simi¬ 
lar  paragraphs  arc  less  likely  to  occur  in  the  same 
article).  Finally,  we  have  changed  our  machine 
learning  approach  to  allow  for  values  of  similarity 
in  the  full  range  between  0  and  1  rather  than  the 
“yes”/“no”  decisions  that  Ripper  supports.  Such 
real-valued  similarities  enable  the  clustering  com¬ 
ponent  of  SimFinder  to  give  higher  weight  to  para¬ 
graph  pairs  that  arc  more  similar  than  others. 

We  use  a  log-linear  regression  model  to  convert 
the  evidence  from  the  various  features  to  a  single 
similarity  value.  This  is  similar  to  a  standard  re¬ 
gression  model  (i.e.,  a  weighted  sum  of  the  features) 
but  properly  accounts  for  the  changes  in  the  output 
variance  as  we  go  from  the  normal  to  the  binomial 
distribution  for  a  response  between  0  and  1  (Mc- 
Cullagh  and  Nelder,  1989).  A  weighted  sum  of  the 
input  features  is  used  as  an  intermediate  predictor, 
rj,  which  is  related  to  the  final  response  R  via  the 
logistic  transformation. 


1  +  ev 


Via  an  iterative  process,  stepwise  refinement,  the 
log-linear  model  automatically  selects  the  input  fea¬ 
tures  that  increase  significantly  the  predictive  ca¬ 
pability  of  the  model,  thus  avoiding  overlearning. 
The  model  selected  7  input  features,  and  resulted 
in  a  remarkable  increase  in  performance  over  the 
Ripper  output  (which  itself  offered  significant  im¬ 
provement  over  standard  IR  methods),  to  49.3% 
precision  at  52.9%  recall.  Table  1  summarizes  the 
evaluation  scores  obtained  by  the  different  methods, 
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Figure  4:  Precision  and  recall  curves  for  the  log- 
linear  version  of  SimFinder  at  various  decision 
thresholds. 


while  Figure  4  shows  the  precision-recall  curves 
corresponding  to  the  log-linear  version  at  different 
cutoff  thresholds  for  considering  two  paragraphs  as 
similar.  As  in  the  case  of  the  Ripper  model,  the 
automatic  selection  of  multiple  features  in  the  log- 
linear  model  validates  our  hypothesis  that  more  than 
straightforward  word  matching  is  needed  for  effec¬ 
tively  detecting  similarity  between  small  pieces  of 
text. 

3  Clustering  Algorithm 

Once  similarities  between  any  two  text  units  have 
been  calculated,  we  feed  them  to  a  clustering  al¬ 
gorithm  that  partitions  the  text  units  into  clusters 
of  closely  related  ones.  This  module  was  added  to 
SimFinder  after  our  earlier  publication  of  our  ap¬ 
proach  to  similarities,  replacing  an  earlier  heuristic 
placeholder,  and  is  described  in  this  paper  for  the 
first  time.  Once  again  we  depart  from  traditional 
IR  algorithms,  opting  instead  to  use  an  algorithm 


more  appropriate  to  the  summarization  task’s  re¬ 
quirements.  In  Information  Retrieval,  hierarchical 
algorithms  such  as  single-link,  complete-link,  and 
groupwise-average,  as  well  as  online  valiants  such 
as  single  pass  arc  often  used  (Frakes  and  Baeza- 
Yates,  1992).  Compared  to  non-hicrarch ical  tech¬ 
niques,  such  algorithms  trade  off  some  of  the  qual¬ 
ity  of  the  produced  clustering  for  speed  (Kaufman 
and  Rousseeuw,  1990),  or  are  sometimes  imposed 
because  of  additional  requirements  of  the  task  (e.g., 
when  documents  must  be  processed  sequentially  as 
they  arrive).  For  summarization,  however,  the  dis¬ 
tinctions  between  paragraphs  arc  often  tine-grained, 
and  there  arc  usually  much  fewer  related  paragraphs 
to  cluster  than  documents  in  an  IR  application. 

We  have  therefore  adopted  a  non-hicrarch  ical 
clustering  technique,  the  exchange  method  (Spath, 
1985),  which  casts  the  clustering  problem  as  an  op¬ 
timization  task  and  seeks  to  minimize  an  objective 
function  $  measuring  the  within-cluster  dissimilar¬ 
ity  in  a  partition  V  =  {C i,  C2, . . . ,  Ck}, 

( 

k 

*(?)  =  E  E  d(x>y) 

i=  1  '  x>yeCi 

\  x+y 

where  the  dissimilarity  d(x.  y )  is  one  minus  the  sim¬ 
ilarity  between  x  and  y. 

The  algorithm  proceeds  by  creating  an  initial  par¬ 
tition  of  the  text  units  that  arc  to  be  clustered,  and 
then  looking  for  locally  optimal  moves  and  swaps 
of  text  units  between  clusters  that  improve  <F,  un¬ 
til  convergence  is  achieved.  Since  this  is  a  hill¬ 
climbing  method,  the  algorithm  is  called  multiple 
times  from  randomly  selected  starting  points,  and 
the  best  overall  configuration  is  selected  as  the  final 
result. 

We  have  further  modified  the  clustering  method 
to  address  some  of  the  characteristics  of  data  sets  in 
summarization  applications.  To  reduce  the  number 
of  paragraphs  considered  for  clustering,  we  impose 
an  adjustable  threshold  on  the  similarity  values,  ig¬ 
noring  paragraph  pairs  for  which  their  evidence  of 
similarity  is  too  weak.  By  adjusting  this  threshold, 
we  can  have  the  system  create  small,  high-quality 
clusters  or  large,  noisy  clusters  as  needed.  Since  ev¬ 
ery  paragraph  in  that  filtered  set  is  similar  to  at  least 
another  one,  we  impose  an  additional  constraint  on 
the  clustering  algorithm  to  never  produce  singleton 
clusters. 

We  also  have  adopted  an  appropriate  heuristic  for 


estimating  the  number  of  clusters  for  a  given  set 
of  paragraphs.  Since  each  cluster  is  subsequently 
transformed  into  a  single  sentence  of  the  final  sum¬ 
mary,  many  small  clusters  would  result  in  an  overly 
lengthy  summary  while  a  few  large  clusters  would 
result  in  a  summary  that  omits  important  informa¬ 
tion.  We  use  information  on  the  number  of  links 
passing  the  similarity  threshold  between  the  clus¬ 
tered  paragraphs,  interpolating  the  number  of  clus¬ 
ters  between  the  number  of  connected  components 
in  the  corresponding  graph  (few  clusters,  for  very 
dense  graphs)  and  half  of  the  number  of  paragraphs 
(lots  of  clusters,  for  very  sparse  graphs).  In  other 
words,  the  number  of  clusters  c  for  a  set  of  n  text 
units  in  m  connected  components  is  determined  as 


( ,  _  l°g(L)\ 

V  log  (P)J 


where  L  is  the  observed  number  of  links  and  P  (= 
n{n  —  l)/2)  is  the  maximum  possible  number  of 
links.  We  use  a  non-linear  interpolating  function  to 
account  for  the  fact  that,  usually,  L  <C  P. 

Partial  output  from  a  sample  clustering  of  news 
paragraphs  is  shown  in  Figure  5. 


4  From  Clusters  to  Summaries 

Clustering,  as  implemented  in  SimFinder,  pro¬ 
vides  a  flexible  means  for  organizing  related  infor¬ 
mation  in  a  form  that  can  be  subsequently  turned 
into  summaries  of  varying  formats  and  complexi¬ 
ties.  Each  cluster  captures  information  salient  to 
a  particular  facet  of  the  input  data,  often  a  specific 
event,  fact,  or  opinion.  As  an  initial  step,  key  terms 
in  each  cluster  can  be  collected  and  used  as  an  in¬ 
dicative,  free-form  summary  (Witten  et  ah,  1999). 
Alternatively,  one  sentence  or  paragraph  per  cluster 
can  be  selected,  producing  an  extracted  summary. 
These  sentences  can  be  chosen  using  simple  posi¬ 
tional  features  (e.g.,  the  sentence  located  earliest  in 
its  source  article)  or  as  the  centroids  of  their  cluster 
(Radev  et  ah,  2000). 

We  report  on  two  specific  schemas  for  convert¬ 
ing  the  clusters  to  summary  sentences  that  we  have 
implemented  at  Columbia  University,  which  never¬ 
theless  do  not  exhaust  the  possibilities.  Our  focus  in 
this  paper  is  primarily  in  establishing  the  usefulness 
of  SimFinder  as  a  component  of  summarization 
systems  using  variable  content  selection  or  genera¬ 
tion  back  ends;  hence,  we  do  not  discuss  the  back 
end  systems'  operation  in  depth. 


Cluster  1 

MEXICO  CITY  (Reuters)  -  The  United  Nations’ 
human  rights  chief  on  Wednesday  said  Mexico 
was  taking  steps  to  improve  its  rights  problems 
but  was  still  failing  to  bring  all  those  responsible 
for  abuses  to  justice. 

MEXICO  CITY  (AP)  -  The  top  U.N.  human 
rights  official  said  Wednesday  that  attacks  on  ac¬ 
tivists  and  faulty  law  enforcement  were  among 
Mexico’s  most  serious  human  rights  woes,  but 
she  applauded  its  president  for  recognizing  his 
country  has  such  problems. 


Cluster  2 

“I  was  impressed  that  he  (Zedillo)  was  not  deny¬ 
ing  there  were  difficulties,”  she  said. 

“He  (Zedillo)  was  very  open  about  there  be¬ 
ing  difficulties  ...  I  was  impressed  that  he  was 
not  denying  those  difficulties,”  Robinson  told  re¬ 
porters  at  a  ceremony  in  which  she  and  Mexi¬ 
can  officials  signed  a  letter  of  understanding  on 
rights  promotion. 


Figure  5:  Automatically  produced  clusters  of  paragraphs  (partial  clusters  arc  shown). 


The  first  of  these  summarization  systems,  Cen- 
TRIFUSER,  utilizes  SimFinder’s  output  in  the 
medical  domain.  In  the  context  of  the  multidisci¬ 
plinary  Digital  Library  project  at  our  site,  we  arc 
looking  for  ways  to  summarize  multiple  medical 
articles  for  either  patients  or  doctors.  For  patient- 
oriented  summaries,  Centrifuser  retrieves  infor¬ 
mation  from  a  number  of  online  health  resources  to 
increase  coverage,  but  needs  SimFinder  to  unify 
the  information  and  eliminate  redundancy. 

We  take  advantage  of  broad  domain  knowledge 
principles  for  the  organization  of  the  summaries, 
presenting  information  on  topics  such  as  diseases, 
diagnosis,  and  treatment  separately.  Centrifuser 
stratifies  the  input  data  according  to  each  such  broad 
topical  class;  calls  SimFinder  to  organize  the  sen¬ 
tences  within  each  topic  into  clusters;  and  then  picks 
one  representative  sentence  from  each  cluster  to 
form  the  final  summary.  Two  heuristics  arc  used  for 
the  sentence  selection  phase:  clusters  spread  over 
multiple  documents  are  selected  first,  to  ensure  that 
in  a  summary  of  limited  length  the  most  general  in¬ 
formation  is  included;  and  sentences  near  the  start 
of  their  documents  are  preferred,  to  minimize  dan¬ 
gling  references.  Figure  6  shows  a  summary  about 
the  heart  condition  “angina”  produced  by  CENTRI¬ 
FUSER  out  of  five  related  documents,  each  between 
2,700  and  7,000  words  long. 

The  second  approach  for  summary  generation, 
MultiGen  (Barzilay  et  ah,  1999),  goes  beyond 
sentence  extraction  into  reformulation.  Summa¬ 
rization  by  extraction  has  a  number  of  well-known 
undesired  effects  (McKeown  et  ah,  1999):  sen¬ 
tences  taken  out  of  context  often  include  embedded 
phrases  that  arc  not  salient  enough  for  a  summary, 
may  bias  the  summary  towards  a  particular  detail, 


Treatment  is  designed  to  prevent  or  reduce 
ischemia  and  minimize  symptoms.  Angina 
attacks  usually  last  for  only  a  few  minutes, 
and  most  can  be  relieved  by  rest.  Most  often 
the  discomfort  occurs  after  strenuous  phys¬ 
ical  activity  or  an  emotional  upset.  A  doc¬ 
tor  diagnoses  angina  largely  by  a  person’s 
description  of  the  symptoms.  The  underly¬ 
ing  cause  of  angina  requires  careful  medical 
treatment  to  prevent  a  heart  attack.  Not  ev¬ 
eryone  with  ischemia  experiences  angina.  If 
you  experience  angina,  try  to  stop  the  activ¬ 
ity  that  precipitated  the  attack. 

Figure  6:  CENTRIFUSER  output  for  “angina.” 


and  may  create  dangling  references  and  disfluen- 
cies.  For  example,  picking  any  one  sentence  from 
the  cluster  in  Figure  7  results  in  the  inclusion  of 
some  unnecessary  details.  MultiGen  analyzes  the 
sentences  in  each  cluster  produced  by  SimFinder 
and  regenerates  instead  a  new  sentence  containing 
just  the  information  common  to  almost  all  sentences 
in  a  cluster.  It  operates  in  three  phases:  parsing 
the  sentences  in  each  cluster  with  an  existing  sta¬ 
tistical  parser  (Collins,  1996),  matching  the  central 
elements  in  the  resulting  dependency  trees  (allow¬ 
ing  for  paraphrases),  and  finally  generating  a  new 
sentence  from  these  matched  elements.  Regenera¬ 
tion  can  be  achieved  in  two  ways:  Either  by  map¬ 
ping  the  predicate-argument  structure  produced  by 
our  matching  algorithm  to  the  functional  represen¬ 
tation  expected  by  FUF/SURGE  (Elhadad,  1993; 
Robin,  1994)  using  additional  constraints  on  real- 


The  quake  had  a  magnitude  of  6.9,  following  an 
earthquake  in  the  same  region  in  February  which 
killed  2,300  people  and  left  thousands  homeless. 

The  quake  registered  6.9  on  the  Richter  scale,  cen¬ 
tered  in  a  remote  part  of  the  country. 

Contacted  at  his  headquarters  in  northern  Afghan¬ 
istan,  Abdullah  said  he  feared  thousands  of  people 
may  have  died  in  the  devastating  quake  in  northeast¬ 
ern  Afghanistan,  with  a  preliminary  magnitude  of 
6.9. 


(a) 

The  quake  had  a  magnitude  of  6.9. 

(b) 

Figure  7:  (a)  A  SlMFlNDER-produced  cluster  of 
similar  sentences  where  any  one  of  them  includes 
unnecessary  details;  (b)  MultiGen  output  for  this 
cluster. 


ization  choice  based  on  surface  features  in  place 
of  the  semantic  or  pragmatic  ones  typically  used 
in  sentence  generation;  or  by  selecting  a  sentence 
from  the  cluster  as  a  skeleton,  and  modifying  it  to 
include  only  phrases  matched  across  the  cluster  en¬ 
tire  while  preserving  the  grammatical  validity  of  the 
sentence.  While  the  first  approach  is  more  general 
and  allows  us  to  produce  more  complex  sentences, 
the  second  approach  is  robust  in  a  noisy  environ¬ 
ment.  For  the  example  of  Figure  7,  both  techniques 
would  produce  “The  quake  had  a  magnitude  of  6.9”; 
note  that  this  is  explicit  in  the  first  sentence,  but  ex¬ 
pressed  via  paraphrases  such  as  had  ~  registered  in 
the  other  two. 

5  Conclusion 

We  have  presented  developments  in  our  similarity 
module  and  a  recently  added  clustering  algorithm, 
which  jointly  form  a  flexible  tool  for  converting  tex¬ 
tual  data  into  groups  of  related  text  units  that  can  be 
further  reduced  to  single  sentences  in  a  summariza¬ 
tion  system.  We  have  demonstrated  quantitative  im¬ 
provements  in  performance  when  compared  to  ear¬ 
lier  work  and  standard  IR  techniques,  and  shown 
how  the  information  that  our  system  produces  can 
be  used  by  some  very  different  approaches  to  the 
final  summary  production. 

We  arc  currently  focusing  on  extending  SlM- 


Finder  to  multilingual  features  and  increasing  the 
robustness  of  its  feature  extraction  process.  One 
way  to  adapt  our  similarity  model  for  documents 
in  multiple  languages  is  to  re-examine  the  features 
used  and  select  those  that  can  be  extracted  from 
languages  with  less  developed  NLP  tools  than  En¬ 
glish.  Resilience  during  translation  is  also  a  factor 
(we  arc  looking  at  proper  names,  for  example,  as 
one  feature  that  we  expect  would  be  easy  to  trans¬ 
late  reliably  even  from  minority  languages).  At  the 
same  time,  we  are  testing  SimFinder’s  portabil¬ 
ity  in  yet  another  domain  in  cooperation  with  the 
University  of  Massachusetts  at  Amherst  (who  pro¬ 
vide  their  TDT  system  for  initial  document  cluster¬ 
ing)  and  MITRE,  compiling  an  additional  15,000 
of  judgments  on  paragraph  and  sentence  similarities 
for  training  and  evaluation.  We  arc  in  the  process  of 
formally  measuring  the  effectiveness  of  the  cluster¬ 
ing  component  of  SimFinder  (Section  3)  relative 
to  the  more  commonly  used  hierarchical  clustering 
techniques.  We  arc  also  looking  at  ways  to  increase 
SimFinder’s  accuracy  in  discovering  similarities 
that  might  be  obscured  by  additional  information 
in  one  or  more  of  the  matching  sentences  or  para¬ 
graphs;  for  example,  by  clustering  at  the  clause  as 
well  as  the  sentence  and  paragraph  levels,  and  by 
reducing  the  relative  weight  of  features  in  subordi¬ 
nate  clauses. 
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